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h k + -£x;(bk — A k h k ) are studied, where x £ (0? 1), {A k }£i are symmetric, 
positive semidefinite random matrices and are random vectors. It is 

shown that | h n — A~ 1 6 | = o(n~ 7 ) a.s. for the 7 E [0, x), positive definite A 

n n 

and vector b such that ^ 52 (A k —A) —>• 0 and ^ 52 (b k —b) —>• 0 a.s. 

k =1 k =1 

When x — 7 G , l) , these assumptions are implied by the Marcinkiewicz 
strong law of large numbers, which allows the { A *,} and {} to have heavy- 
tails, long-range dependence or both. Finally, corroborating experimental 
outcomes and decreasing-gain design considerations are provided. 
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1. Introduction 

Linear stochastic approximation algorithms have found widespread application 
in parameter estimation, adaptive machine learning, signal processing, econo¬ 
metrics and pattern recognition (see, e.g., [1], [3], [9], [25] and [29]). Conse¬ 
quently, their asymptotic rates of almost sure and r th - mean convergence as 
well as invariance and large deviation principles are of utmost importance (see 
e.g., [6], [11], [17], [18], [21], [22], [23], [31] and [32]). For motivation, suppose 
{xk, k = 1, 2, ■ ■ •} and {yk, k = 2, 3, • • •} are second order and M— valued 
stochastic processes, defined on some probability space (Sl,F,P), that satisfy 

Vk- (-1 = h T ^ki V k = 1 , 2 ,..., (1) 

where h is an unknown d-dimensional parameter or weight vector of interest and 
Cfc is a noise sequence. One often wants to find the value of h that minimizes the 
mean-square error h — > E\yk+i~x^ h\ 2 . This best h is given by h = A~ 1 b 1 where 
A = E(xkX^) and b = E(yk+iXk), assuming the expectations exist, wide-sense 
stationarity conditions and that A is positive definite. However, we often do not 
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know the joint distribution of (xk,yk+i) nor have the necessary stationarity but 
instead estimate h using a linear algorithm of the form: 


hk+i = hk + fJ-k(bk - A k hk), ( 2 ) 

where /ik is the k th step size (often of the form /i*, = k~ x for some % £ ( 3 , l]) 
and 

\ k 1 k 

Ak = ~N X] XlX T ’ and bk = N 5Z Vi+I x l (3) 

Z=max{/c—iV+1,1} Z=max{/c—iV+1,1} 

for some N £ N, are random sequences of symmetric, positive-semi-definite 
matrices and vectors respectively. Most often N = 1 so Ak = Xkx\ and bk = 
yk+iXk■ More information on stochastic approximation can be found in e.g. [ 8 ], 
[10], [13], [17], [24], [28] and [33], which provide examples and motivation for our 
work. However, our work is easily differentiated from these. Delyon [8], for exam¬ 
ple, focuses on non-linear stochastic approximation algorithms, treating linear 
examples the same as non-linear ones. (In Section 4.2.2 he uses linear algorithm 
approximation but with a constant deterministic matrix Ak = A in our nota¬ 
tion.) Delyon’s work handles important applications. However, his A-stable and 
(A, B) Conditions are usually harder to verify than our Marcinkiewicz Strong 
Law of Large Numbers (MSLLN) conditions (given below) in the (unbounded, 
random Ak) linear case, he does not supply almost sure rates of convergence, 
his theorems are geared to martingale-increment-plus-decreasing-perturbation 
noise and he often assumes fourth order moments. We are motivated by (but 
not restricted to) the common setting where Xj = (x^, yk+i) is a (multivariate) 
linear process 


Xk = ^ Ck-i'Ei. (4) 

l = — OO 

Matrix sequence (Ci) can decay slowly enough (as |Z| 00 ) for long-range de¬ 

pendence (LRD) while {S;} can have heavy tails (HT), so E\bk\ 2 = 00 and/or 
EjAfc | 2 = 00 . Even in the lighter tail, short-range dependence case our two- 
sided linear process example {cc*,} is not a martingale. Moreover; long-range 
dependence and heavy tails; exhibited in many network [19], financial and pale- 
oclimatic data sets for example; voids the usual mixing and moment conditions. 
We focus on one-step versus Polyak-Ruppert’s two-step averaging algorithms 
but handle heavy tails and long range dependence, deriving a surprising decou¬ 
pling. This means that the optimal convergence rate of (2) is affected by either 
the heavy tails or the long-range dependence, whichever is worse, but not both. 
This contrasts the rate for partial sums of long-range dependent, heavily-tailed 
random variables, which is degraded twice (see e.g. Theorem 4). 

Step size y,k has a direct effect on the convergence rate and algorithm effec¬ 
tiveness (see, e.g [12], [15] and references cited therein). Consider the extreme 
cases. In the homogeneous, deterministic setting, i.e. Ak = A and bk = b , 
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( 2 ) can solve the linear equation Ah = b when matrix inversion of A is ill- 
conditioned. In this case, a constant gain y,k = e is best: Since b = Ah , we have 
hk +1 = hk — eA(hk — h), so h n — h = (/ — eA) n ~ 1 (hi — h) and h n -A h geomet¬ 
rically, provided e is small enough that the eigenvalues of / — eA are within the 
unit disc. Conversely, in the presence of persistent noise, decreasing step sizes 
are required for the convergence h n -A h. Existing results show that the best 
possible almost-sure rate of convergence is \h n — h\ = O log log(n) ^ , 

implied by the law of the iterated logarithm, and that this rate is only attain¬ 
able when Hk = i, second moments of Ak,bk exist and there is no long-range 
dependence. (These claims follow from the almost-sure invariance principle in 
Kouritzin [23].) 

Herein, we handle all gains, long range dependence and heavy tails, addressing 
the optimal rate of convergence by establishing results akin to the MSLLN, 
namely \h n — h\ = o(n -7 ) for all 7 < q^ = \—M. M is called the Marcinkiewicz 
threshold in the sequel and is defined by 

11” 1 

M = inf{— : lim —— (A k — A) = 0, lim —— —&) = 0 a.s.}. (5) 

777, n —>00 ^ ^ J n—>00 ^ ^ 

U k =1 U k=1 

Usually, we expect M £ (1,1], due to Strong Law of Large Numbers and Central 
Limit Theorem in the light-tail, short-range-dependence case but when there is 
LRD and/or HT M generally cannot approach 5 . When {(x^ ,yk+i) T : k £ Z} 
is a linear process as in (4), it is shown in [20] that M = ^ V (2 — 2<r) with a = 

sup{a < 2 : supf a P(|Si | 2 > t) < 00 } and a = sup{s £ (|, 1 ] : sup |Z| S ||C;|| < 
t> 0 1 

00 }. Hence, 7 < = (x — A (y + 2 ct — 2). Here, a £ (1, 2] is a heavy-tail 

parameter with a = 2 indicating non-heavy tails and a £ (^, l] is a long-range 
dependence parameter with a = 1 indicating the minimal amount of long-range 
dependence. 

In classical applications the best theoretical convergence rate is attained when 
X = 1 corresponding to q^ = However, this rate knowledge can lead to er¬ 
roneous conclusions as the algorithm often performs better with fik = k~ x for 
some x < 1 than with yk = How might one explain this apparent paradox? 
First of all, these simple rate-of-convergence results do not account for the pos¬ 
sibility of exploding constants, i.e. if denotes the solution of the algorithm 
(2) with n k = k then |/ii x) - h\ = D x n~^ x) for all q^ < q^ x) . However, 
this D x often increases rapidly as x 1 so the observed convergence may be 
fastest for some x < 1. Secondly, a higher value of x is worse for forgetting a 
poor initial guess ho of h since you move further and further from the geometric 
convergence mentioned above as y —> 1 . 

Our approach is to transfer the MSLLN from the partial sums of a linear 
algorithm’s coefficients to its solution. In other words, we establish the almost 
sure rates of convergence \h n — h\ = o(n -7 ) for the algorithm 

h k +1 = h k + — (b k - A k hk) V k = 1 , 2 , 3 ,... 


(6) 
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with x £ ( 0 , 1 ), assuming only 

^ n i n 

lim — V'(A fe -A) = 0 and lim ——'S^(b k -A k h) = 0 a.s. (7) 

fc=l fc=l 

for some 7 € [ 0 , y). which can be implied by e.g. 

^ n 1 n 

lim —-^(A fe -A) = ° and lim —- ^ (b k - b) = ° a.s., ( 8 ) 

fc=l fc=l 

where Ah = b. When X — 7 £ ( 5 , 1 ], these conditions can be verified by the 
MSLLN under a variety of conditions, which we study using the specific struc¬ 
ture of A k and b k in Section 3. 

In addition to rates of convergence, our results show that convergence (h k —► 
h) in ( 6 ) takes place provided that x £ (Af, 1). All this suggests that more 
quickly decreasing gains like /ift = rv with x near 1 should be used in very heavy¬ 
tailed or long-range dependent settings. Conversely, slowly deceasing gains like 
// = At with smaller x might work well in lighter-tailed, short-range-dependent 
situations. Our simulations in Section 4 show that the smallest normalized er¬ 
ror, , usually occurs for x £ (M, 1 ] and the most commonly used choice 

X = 1 is most appropriate in very heavy-tailed or long-range-dependent settings 
(where M is close to 1) or very long runs. In other words, a slower decreasing 
gain usually gets you close to the true parameters h more quickly unless the 
coefficients have a high probability of differing significantly from their means. 

Let us consider what is new in terms of our theoretical results. The idea 
of inferring convergence and rates of convergence results for linear algorithms 
( 2 ) from like convergence and rates of convergence of its coefficients is not new. 
Indeed, it dates back at least to work done by one of the authors in 1992 and 1993 
(see [21],[22] and [23]). The first result [21] considered relatively general gain fi k 
and achieved optimal rates of r th -mean convergence. It has been proved in [23] 
that the solution of the linear algorithm ( 2 ) satisfies an almost sure invariance 
principle with respect to a limiting Gaussian process when \ and each A k 
is symmetric under the minimal condition that the coefficients satisfy such an 
a.s. invariance principle. One could then immediately transfer functional laws of 
the iterated logarithm from the limiting Gaussian process back to the solution 
of the linear algorithm. Again assuming the “usual” conditions of A k symmetry 
and /Zfc = i Kouritzin [22] showed that the solution of the linear algorithm 
converges almost surely given that the coefficients do. While this result does 
not state rates of convergence, our current work in going from Proposition 1 to 
Theorem 1 within shows that almost-sure rate of convergence sometimes follow 
from convergence results for linear algorithms as a simple corollary. 

There were many results (see, e.g. [13], [14] and [16]) that preceded those 
mentioned above and gave convergence or rates of convergence for linear algo¬ 
rithms. However, these results assumed a specific dependency structure and, 
thereby, were not generally applicable. More recently, some authors, e.g. [ 6 ], 
[ 8 ] and [31], have followed the path of transferring convergence and rates of 
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convergence from partial sums of (the coefficient) random variables to the solu¬ 
tions of linear equations. Specifically, Tadic [31] transferred almost-sure rates of 
convergence, including those of the law-of-the-iterated-logarithm rate, from the 
coefficients to the linear algorithm in the non-symmetric-Afc, general-gain case. 
He does not develop a law of the iterated logarithm where one characterizes the 
limit points nor does he consider functional versions. Moreover, he imposes one 
of two sets of conditions (A and B in his notation). Conditions B ensure the 
gain fik « j :, so these results should be compared to prior results in [23] and [4] , 
which imply stronger Strassen-type functional laws of the iterated logarithm. 
Tadic does not give any examples verifying his Conditions A where lessor rates 
are obtained. 

It seems that we are the first to consider processes that are simultaneously 
heavy-tailed and long-range dependent in stochastic approximation. 

The rest of this paper is organized as follows. Main theorems are formulated 
in Section 2. Then, Section 3 includes some background about the Marcinkiewicz 
Strong Law of Large Numbers for Partial Sums and a new MSLLN result for 
outer products of multivariate linear processes with LRD and HT. Experimental 
results are given in Section 4 and proof of main result (Theorem 1) is delayed 
until Section 5. 


2. Notation and Theoretical Result 

In this section, we define our notation and provide our results. 

2.1. Notation List 

|x| is Euclidean distance of some R d -vector x. 
lie'll = su P | x | =1 |Cx| for any R nxm -rnatrix C. 
d d 

HI A ||| 2 = ^^(A (n ’ o) ) 2 , is the (n,o) th components of A £ R dxd . 

n= 1o=l 

|_fj = max{* £ No : * < t} and \t] = min{* £ No : * > t} for any t > 0 . 

i 

cti.k "C ^i,k means that for each k there is a c*, > 0 that does not depend upon 
i such that |a.*,*| < Cfc|| for all i,k. 

fl Bi (V Bi being a i? dxd - ma trix) = B q B q _i • • • B p if q > p or / if p > q. 
i=p 

a V b = max{a, 6 } and a A b = min{a, b}. 

2.2. Main Results 

We will state and prove our results in a completely deterministic manner and 
then apply these results on a sample path by sample path basis. Therefore, 
we assume that x £ ( 0 , 1 ), d is a positive integer, {A fc }^ =1 is a symmetric, 
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positive semidefinite R dxd -valued sequence, is a Revalued sequence 

and {h k } k L 1 is a Revalued sequence satisfying: 


h k +1 = h k -\-—(b k - A k h k ) for all k = 1,2, 


(9) 


Our first main result establishes rates of almost sure convergence: 

Theorem 1 Suppose 7 G [0,x)> h G R rf and A is a symmetric positive-definite 
matrix. 

a) If 


lim 

n—too 


1 

n x 


J2Ak~A) 

fc=1 


= 0 , 


and 


( 10 ) 


lim 

n—> 00 




— '£(b k -A k h) 


fc =1 


= 0, 


f/ien |/i n — /i| = o(n 7 ) as n ^ 00 . 


( 11 ) 


b) Conversely, lim 

n—>-00 


nX 

fc=l 


0 , */ lim Ifc 1 x (h k — h)\ =0 and 

k—t 00 


1 


E**~‘ 

fc=l 



zs bounded in n. 


( 12 ) 


Remark 1 Lemmal of Appendix establishes that (10) implies (12). 

Theorem 1 with yfy = A k (ui) and b k = 6 fc(w) for all fc, implies h n (ui ), the solution 
of (2), converges to /i = T -1 6 a.s. Indeed, to establish the rate of convergence 
\h n — h\ = o(n -7 ), one need only check standard conditions for the MSLLN in 
( 10 ) and ( 11 ), which is less onerous task than checking the technical conditions 
in Corollary 1 or Corollary 3 in [31] say. Indeed, there appears to be a need 
for some extra stability in [31] by the imposition that “the real parts of the 
eigenvalues of A should be strictly less than a certain negative value depending 
on the asymptotic properties of { 7 ^} and {$„}”. We do not need any such extra 
condition. 

Generally, we do not know h when using stochastic approximation so we 
cannot just verify (11) but rather use the following corollary instead of Theorem 

1 . 

Corollary 1 Suppose 7 G [0, x) and A is a symmetric positive-definite matrix. 

^ n 1 n 

- b) -> 0 and ^{A k - A) ->• 0 a.s. (13) 

k=1 k=1 


Then, \h n — h\ = o(n 7 ) as n ^ oo. 
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Finally, we give a version of the theorem for linear processes under very 
verifiable conditions. 

Theorem 2 Let {S/} be i.i.d. zero-mean random W m -vectors such that 

supt“P(|Si | 2 > t) < oo for some a £ (1,2) 
i> o 

(Cz)zez be R( d+1 ' >xm -matrices such that sup |Z| CT ||Cz || < oo for somea £ (|, ll, 

OO 

,Vk+i) T = ^2 
l ——oo 

A k = x k xl, b k = y k+ ix k and A = E[x k xl] and b = E[y k+1 x k }. 

Then, \h n — h\ = o(n~ J ) as n —> oo a.s. for any 7 < 7 q X ^ = (%— ^:)A(x+2cr — 2). 

Remark 2 Theorem 2 follows from Corollary 1 and Theorem 6 (to follow), 
by letting ± = % — 7 and X k = X(f = (x^,y k + 1 ) and correspondingly, 2; = 
Sz, Ci = Ci and a = a. a and a are long-range dependence and heavy-tail 
parameters, respectively. Theorem 6 also appears in [20, Theorem A]. 

3. Marcinkiewicz Strong Law of Large Numbers for Partial Sums 

Our basic assumptions are MSLLN for random variables for {A k } and {b k }. 
(Technically, our assumptions are even more general as they allow the non- 
MSLLN case where x ~ 7 < \ that could be verified by some other method in 
some special situations.) The beauty of this MSLLN assumption is that: 1) It is 
minimal in the sense that the linear algorithm with A k = I and p k = \ reduces 

k 

to the partial sums h k +\ — h = jj (bj~b) (since h = b when A = I) so a rate of 

j = 1 

convergence in the algorithm solution h k implies a MSLLN for random variables 
{bj}. 2) MSLLNs hold under very general conditions, including heavily-tailed 
and long-range dependent data. Hence, we review some of the literature in this 
area before giving simulation results for our theoretical work. 

The classical independent case, due to Marcinkiewicz, is generalized slightly 
by Rio [27]: 

Theorem 3 Let {Xi} be an m-dependent, identically distributed sequence of 
zero-mean 'BL-valued random variables such that E\Xf\ p < 00 for some p £ (1,2). 
Then, 

1 n 

—— ^ ] X{ —> 0 a.s. 

nr i=1 

Actually, Rio gives a more general m-dependent result on page 922 of his work. 
However, the important observation for us is that only the p th moment need 
be finite rather than a higher moment as is typical under some stronger depen¬ 
dence assumptions. Theorem 3 is quite useful in verifying our conditions when 
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{Ak} and {&&} may have heavily-tailed distributions but are independent or 

m-dependent. For example, if% — 7 S (|, l) and the {Ak} and {bk} are defined 

2 2 

as in (3) in terms of i.i.d. {xk} and {yk} with E |xi|*—r < 00 , E\yi\x-i < 00 , 
then {Ak}k>M and {bk}k>M are identically distributed, M-dependent and 

^ n 1 n 

- A) 0 and —^^2(b k - b) ->■ ° 

k —1 fe =1 

a.s., where A = EAk and b = Ebk , by applying Theorem 3 for each component. 
Hence, ( 8 ) holds. 

There are many other important results that include heavy-tails, long-range 
dependence or both. For example, Louhchi and Soulier [26] prove the following 
result for linear symmetric a-stable (SaS) processes. 

Theorem 4 Let {Cj}jez be i.i.d. sequence of SaS random variables with 1 < 
a < 2 and {cj}j £ z be a bounded collection such that |cj| s < 00 for some 

3 CL 

s G [1, a). Set Xk= J2 c k-jCj■ Then, for p G (1,2) satisfying ± > 1 - i + T 
jez 

1 " 

— 5 - ^ ] Xi — y 0 a.s. 

i= 1 

The condition s < a ensures \ c j\ a < 00 and thereby convergence of X) c fc-jC.r 

je z tez 

Moreover, {Xk} not only exhibits heavy tails but also long-range dependence if, 
for example, Cj = \j\~ a for j ^ 0 and some er G (|, l). Notice there is interac¬ 
tions between the heavy tail condition and the long range dependent condition. 
In particular for a given p, heavier tails (a becomes smaller) implies that you 
cannot have as long range dependence (s becomes smaller) and vice versa. More¬ 
over, this result is difficult to apply in the stochastic approximation setting. For 
example, if wanted to apply it for Xk = Ak in the scalar case, then we would 
need Xk such that x\ = Ak which is impossible when Ak is SaS. 

One nice feature of mixing assumptions is that they usually transfer from 
random variables to functions (like squares) of random variables. There are 
many mixing results that handle long range dependence. For example, Berbee 
[2] gives a nice /3-mixing result. However, strong mixing is one of the most 
general types of mixing that is more easily verified in practice. Hence, we will 
just quote the following strong mixing result from Rio [27] (Theorem 1) in terms 
of the inverse a - 1 (w) = sup{f G R + : apj > u} of the strong mixing coefficients 

a n = sup sup | P(AB) — P(A)P(B)\ 

feeZ Ae<j(Xi,i<k-n),B£<T(X k ) 

and the complementary quantile function 

Qx(u) = sup{f G R + : P(\X\ > t) > u}. 
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Theorem 5 Let {X^} be an identically-distributed, zero-mean sequence of R- 
valued random variables such that f Q [a~ 1 (t/2)\ p ~ 1 Q p x (t)dt < oo for some p G 
(1,2). Then, 

1 " 

— — ^ ] X^ — > 0 a.s. 

n5 i=l 

Notice again that for a given p , heavier tails implies that you cannot have as 
long range dependence and vice versa: If you wanted to maintain the same value 
of the integral condition and there became more area under P(|X| > t), then 
there would be more area under Q p x (t ) so the area under [a _1 (t/2)] p_1 , which 

OO 

is equal to 2 a p_1 , would have to decrease to compensate. Also, there can be 

n =0 

difficulty in establishing that a given model satisfies the strong mixing condition 
with the required decay of mixing coefficients. Still, this is an important result 
for verifying our basic assumptions. 

A new MSLLN for outer products of multivariate linear processes with long 
range dependence and heavy tails is studied in [20] . A new decoupling property 
is proved that shows the convergence rate is determined by the worst of the 
heavy tails or the long range dependence, but not the combination. This re¬ 
sult used to obtain Marcinkiewicz Strong Law of Large Numbers for stochastic 
approximation (Theorem 2). The result is as follow. 

Theorem 6 Let {S;} and {S/} be i.i.d. zero mean random R m -vectors such 

that Ei = Ei = (?i 1) ,...,c[™ ) )> ^[l s i| 2 ] < oo; -Epil 2 ] < oo 

and max supt a P(|£^£^| > t) < oo for some a G (1,2). Moreover, suppose 

i <i,j<m t >o 

matrix sequences (Ci)jgz, (Cz)iez € R( d+1 ) xm satisfy 

sup |Zn|Ci|| < oo, sup |^r||C';|| < oo for some (ct,ct)g(-,1 , 

IG Z IG Z 

Xp, Xk take form of (f), Dk = XpX 1 ^ and D = E[X\X\\. Then, for p satisfy¬ 
ing p < v A a 


lim 

n—too 


4]T0Dfe-P) = o 


k =1 


This theorem actually shows the MSLLN for Dk — E[Dk\, where Dk = 
T 


x k x% 
Vk+l x k 


Vk+l x k 

Vk+i 


and then throw out the unneeded columns. 


4. Experimental Results 

In this section we now verify our results of the previous section experimentally in 
the stochastic approximation setting discussed in the introduction. In particular, 
we use power law or folded t distributions. 
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Power law distribution: A random variable £ obeys a power law with 
parameters j3 > 1 and x m i n > 0, written £ ~ PL(x m i n , j3), if it has density 

f(x) = - —-(—— ) _/3 V x > x min 

%min %min 

Note that £|£| r = ( r< P~ l 

( oo r > p — 1 

Folded t distribution: A non-negative random variable £ has a folded t 
distribution with parameter /3 > 1, written £ ~ Ft((3), if it has density 


/O) = 


2r(f) 


n^)y/(J3-l)n \ 08 - 1 ) 


1 + 


V x > 0. 


Note that E(|£| 7 ') exists if and only if r < f) — 1. 

Experimental results in this section are divided in two parts. 


f.l. Heavy-tailed cases 


Assume TV = 1 in (3), dimension is d = 2 and {(x\}\ x^\ e k ) T , k = 1, 2,...} are 
i.i.d. random vectors so linear algorithm ( 2 ) reduces to: 

hk +1 = h k +y k (x k yk+i-Xkxlh k ) = h k +y k (x k xlh+x k e k -x k xlh k ). (14) 


For consistency and performance, we always let x^ 11 , xjf 1 and e k be independent. 
The runs are always initialized with hi = (101,101) T and, for testing purposes, 
the optimal h = ( 1 , 1) T is known. 


Example 1 Let x^\x^ ~ PL(x m i n = 1 ,/?) and e k = e' k — E(e' k ) with e k ~ 
PL(x' min = 0.01, /3). The normalized errors in 100 trial simulations, 


1 


are averaged rh = - 


| h$ - h\ 
100 ^ \hi-h\ 


100 

\ ' 


and given in the Table 1 in terms of gain 


parameter x> distributional parameter (3 and sample size n. 

The Marcinkiewicz threshold, M = -nfzzu 1 corresponding to fj = 3.5, /3 = 4 
and P = 4.5 are respectively M = 0.8, 0.67 and 0.57. Our theoretical results 
prove convergence above this threshold. While the results in Table 1 are obviously 
still influenced by (heavy-tailed) randomness, one can see that convergence does 
appear to be taking place as one moves from n = 100, 000 through n = 750, 000 
to n = 1,500,000 when x > M and it is less clear that convergence is taking 
place when \ < Af. Furthermore, our (as well as prior) theoretical results predict 
rates of convergence that increase in x- Indeed, in the case /3 = 4 our theoretical 
results suggest that \ ps 1 should result in a rate of convergence \h n — h\ = 
o(n ~°- 33 ) while x = 0-85 should only result in a rate of convergence | h n — h\ = 
o(n~ 018 ). Conversely, Table 1 demonstrates that x = 0-85 performs better, 
which seems to contradict the theory. However, this paradox is explained by the 
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Table 1 

Algorithm performance-Power Law 


n= 100000 


n=750000 


n=1500000 


x\P 

3.5 

4 

4.5 

3.5 

4 

4.5 

3.5 

4 

4.5 

0.55 

0.1043 

0.0391 

0.0214 

0.0841 

0.0315 

0.0155 

0.0625 

0.0254 

0.0134 

0.6 

0.0864 

0.0314 

0.0169 

0.0707 

0.0243 

0.0115 

0.0548 

0.0203 

0.0099 

0.65 

0.0690 

0.0247 

0.0129 

0.0578 

0.0192 

0.0086 

0.0469 

0.0166 

0.0075 

0.7 

0.0525 

0.0190 

0.0098 

0.0487 

0.0159 

0.0067 

0.0457 

0.0141 

0.0056 

0.75 

0.0397 

0.0151 

0.0082 

0.0449 

0.0137 

0.0051 

0.0456 

0.0114 

0.0042 

0.8 

0.0326 

0.0136 

0.0105 

0.0448 

0.0111 

0.0038 

0.0402 

0.0087 

0.0031 

0.85 

0.0314 

0.0168 

0.0549 

0.0398 

0.0085 

0.0082 

0.0324 

0.0070 

0.0035 

0.9 

0.0344 

0.0719 

0.2445 

0.0438 

0.0118 

0.0764 

0.0272 

0.0079 

0.0341 

0.95 

0.0902 

0.3047 

0.6631 

0.3739 

0.0897 

0.3068 

0.0248 

0.0519 

0.1963 

0.98 

0.2226 

0.5733 

1.0154 

0.9219 

0.2302 

0.5251 

0.0374 

0.1488 

0.3930 

1 

0.3876 

0.8062 

0.6631 

1.1891 

0.3745 

0.6925 

0.0662 

0.2596 

0.5644 


exploding constants discussion of the introduction and, in fact, points out that 
more refined theory, involving functional results, is needed. The proper way to 
use our theoretical results then is to predict the best x (lowest value of rh) in the 
range of (M, 1] i.e. in (0.8,1], (0.67,1] and (0.57,1], respectively for our three 
f}’s. 


Table 2 

Best fixed x~Power Law 



n- 

=100000 


=750000 

n= 

=1500000 

P 

3.5 

4 4.5 

3.5 

4 4.5 

3.5 

4 4.5 

Best x 
Resulting t 

0.85 

0.05 

0.8 0.75 

0.13 0.18 

0.85 

0.05 

0.85 0.8 

0.18 0.23 

0.95 

0.15 

0.85 0.8 

0.18 0.23 


The best x’s, corresponding to the smallest value of rh for (3 = 3.5, P = 4 
and p = 4.5 and 3 different sample sizes, as well as the 7 corresponding to the 
theoretical rate of convergence o(n ~ 7 ) are summarized in Table 2. In all cases 
the best value for % is in the predicted range. As we explained, a faster decreasing 
gain is appropriate for a heavier-tailed distribution, which is also confirmed by 
Table 2. Notice also that the best x increases in n, a phenomenon consistent 
with our exploding constants and the initial condition effect discussion. 

Now, we repeat the previous example with a different distribution. Since the 
results are consistent with those of the previous example, we will keep our 
discussion to a minimum. 

Example 2 Let x^\x^ ~ Ft(p) and £fc = e' k — E(e' k ) with e' k ~ Ft(p). The 
simulation results for three P’s: 3.5,4 and 4.5 with corresponding Marcinkiewicz 
thresholds, M = 737 -, 0.8, 0.67 and 0.57 are given in Table 3 with sample sizes: 
n = 50,000,100,000 and 750, 000. 

A summary of of best \ result is given in Table f. Again, a smaller P cor¬ 
responds to heavier tails and larger best \. Moreover, as we predicted the best \ 
for P = 3.5, p = 4 and P = 4.5 in the range of (0.8,1], (0.67,1] and (0.57,1], 
respectively. Best x ’s increase in sample size. 
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Table 3 

Algorithm performance-Folded t 


x\P 

3.5 

n=50000 

4 

4.5 

n 

3.5 

=10000C 

4 

i 

4.5 

3.5 

n=750000 

4 

4.5 

0.55 

0.1040 

0.0422 

0.0274 

0.1003 

0.0429 

0.0221 

0.0780 

0.0246 

0.0145 

0.6 

0.0958 

0.0345 

0.0221 

0.0929 

0.0336 

0.0177 

0.0590 

0.0195 

0.0104 

0.65 

0.0851 

0.0291 

0.0177 

0.0778 

0.0269 

0.0141 

0.0420 

0.0149 

0.0081 

0.7 

0.0697 

0.0245 

0.0138 

0.0661 

0.0216 

0.0112 

0.0318 

0.0120 

0.0064 

0.75 

0.0599 

0.0204 

0.0113 

0.0556 

0.0173 

0.0089 

0.0336 

0.0099 

0.0050 

0.8 

0.0505 

0.0172 

0.0103 

0.0439 

0.0140 

0.0075 

0.0374 

0.0076 

0.0038 

0.85 

0.0399 

0.0145 

0.0098 

0.0341 

0.0118 

0.0063 

0.0339 

0.0058 

0.0029 

0.9 

0.0312 

0.0133 

0.0087 

0.0278 

0.0100 

0.0057 

0.0265 

0.0048 

0.0024 

0.95 

0.0275 

0.0241 

0.0097 

0.0245 

0.0089 

0.0060 

0.0205 

0.0039 

0.0021 

0.98 

0.0347 

0.0475 

0.0212 

0.0274 

0.0117 

0.0121 

0.0179 

0.00371 

0.0032 

0.99 

0.0404 

0.0583 

0.0295 

0.0310 

0.0149 

0.0172 

0.0173 

0.00373 

0.0048 

1 

0.0486 

0.0700 

0.0413 

0.0369 

0.0205 

0.0249 

0.0170 

0.0039 

0.0077 


Table 4 

Best fixed \-Folded t 




n=50000 



=100000 


n=750000 


3.5 

4 

4.5 

3.5 

4 4.5 

3.5 

4 4.5 

Best x 

0.95 

0.9 

0.9 

0.95 

0.95 0.9 

1 

0.98 0.95 

7 < 

0.15 

0.23 

0.33 

0.15 

0.28 0.33 

0.2 

0.31 0.38 


4-2. Combined Heavy-tailed and Long Range dependence case 

OO 

If we take N = 1 and dimension d = 1, we have ( x k ,y k +i ) = E in 

J = - oo 

which Cj = (cj,Cj) T and Ej = (^EjE'ez are i.i.d.. Hence, x k = J2 c k-jff ] 

jez 

and y k +i = X) where fp = + aj and {aj}’s are i.i.d. zero mean 

je z 

random variables. This relation between fP and is due to the fact that 
yk+i = x k h + e k and e k = X) °k-j a j ■ We consider {cj = |j| -<T },for j 0 and 

j&L 

a € (|, 1], Co = 1. The linear algorithm (2) reduces to: 

hk+i = hk + pk(xkyk +1 - x 2 k hk) = h k + Rk(x k h + x k e k - x\h k ). (15) 
The initial and optimal values are hi = 401 and h = 1. 

Example 3 Let gj 1 ) ~ PL(x m i n = 0.01,/3) and aj = fj — E(fj) with fj ~ 
PL(x' min = 0.01, /?). The simulation is done for one-sided process and since in 
computer we cannot technically do infinite sum, we assume summation over the 
range of (0, 500, 000). As in the last two examples the normalized errors in 100 
trial simulations, {/in ^are averaged and results for different x’ s > fi’ s an d 
sample sizes n are presented in the following tables. The assumed a is 0.65. The 
Marcinkiewicz threshold, M = ^V(2—2cr) ; corresponding to fi = 4, /3 = 4.5 and 
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/3 = 5 is 0.7. Hence, predicted ranges for x’s with smallest rh will be (0.7,1]. 
Simulation results are provided in Table 5 with summary of best \ in Table 
6. It worth noticing that the convergence does not seem to take place below the 
Marcinkiewicz threshold and the best are in the predicted ranges and the nor¬ 
malized error decreases as (3 increases. 


Table 5 

Algorithm performance for LRD-HT cases with a = 0.65 


x\9 

4 

n=100 

5 

4 

n=5000 

5 

4 

n=10,000 

1 

5 

4.5 

4.5 

4.5 

0.55 

0.024582 

0.015503 

0.012051 

0.029090 

0.019016 

0.015093 

0.027746 

0.018037 

0.014275 

0.6 

0.010917 

0.006166 

0.004508 

0.013172 

0.007826 

0.005897 

0.012465 

0.007359 

0.005527 

0.7 

0.000665 

0.000237 

0.000132 

0.000958 

0.000414 

0.000262 

0.000881 

0.000377 

0.000238 

0.75 

2.98e-05 

7.88e-06 

6.49e-06 

9.77e-05 

3.15e-05 

1.70e-05 

8.79e-05 

2.83e-05 

1.52e-05 

0.8 

1.02e-05 

7.76e-06 

6.39e-06 

5.19e-06 

3.80e-06 

3. lie-06 

4.72e-06 

3.30e-06 

2.69e-06 

0.85 

9.91e-06 

7.77e-06 

6.41e-06 

5.01e-06 

3.91e-06 

3.21e-06 

4.38e-06 

3.37e-06 

2.76e-06 

0.9 

9.93e-06 

7.79e-06 

6.45e-06 

5.20e-06 

4.12e-06 

3.39e-06 

4.54e-06 

3.50e-06 

2.86e-06 

0.95 

1.00e-05 

7.90e-06 

6.55e-06 

5.63e-06 

4.42e-06 

3.62e-06 

4.73e-06 

3.65e-06 

2.98e-06 

0.98 

1.01e-05 

7.99e-06 

6.61e-06 

5.97e-06 

4.69e-06 

3.86e-06 

4.88e-06 

3.77e-06 

3.08e-06 

1 

1.02e-05 

8.04e-06 

6.65e-06 

6.28e-06 

4.91e-06 

4.03e-06 

5.02e-06 

3.89e-06 

3.19e-06 


Note that by considering a = 0.65, the minimum of 2 — 2a and ^ for all 
ft = 4,4.5 and 5 is 2 — 2a, hence we do not expect much change in the x o-s /? 
changes. In addition, the rate of convergence for all considered (3’s is determined 
by 7 < X + 2cr — 2. 


Table 6 

Best fixed x~Power Law, LRD with a = 0.65 


P 

4 

n=100 

4.5 

5 

4 

=5000 

4.5 

5 

n= 

4 

=10000 

4.5 

5 

Best \ 

0.85 

0.8 

0.8 

0.85 

0.8 

0.8 

0.85 

0.8 

0.8 

Resulting: x 

0.15 

0.1 

0.1 

0.15 

0.1 

0.1 

0.15 

0.1 

0.1 


5. The proof of Theorem 1 


Part a) Step 1: Reduce rate of convergence to convergence of a transformed 
algorithm. 

Letting r] k = (^=r-) / — 1; setting g k = k 7 ( h k — h) and using (9), one finds that 


where 


9k +1 = 


9 k + (b k - A k g k ) + Rkgk, 


(16) 


k + 1 \ 

k J 


Ah- 


b k = (k + 1) T (b k - A k h ) and A k = 


7 


(17) 
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However, we have by Taylor’s theorem and assumption that 


— < ^E fc asn^oo. 


k= 1 


n x 


fe=i 


Step 2: Show MSLLN for new coefficients i.e. ^ (Afc — A) 


(18) 

0, and 


fe=i 


^ X) bk -+ 0 asn-Mx). 
fe=i 


1 n 
-V 


k + 1 


k =l 
n k 




n* 


-yy 


fc=2 1=2 


J + l 


fc=l 

71 


^E 

1=2 

1 

< - 

n x 


+E 

1=2 


J- 1 


EO 4 *-- 4 ) 


k—2 


L Vi — 1 


j + i 


2 7 - 


1 + 1 


1-1 

sr( 

n + l 7 7 


(A fc - +) 


E l- 4 * - 4 ) 

k—2 


E - 4 ) 

k—2 


71 


1-2 


ra / (j - 2)> 


l-i 




fc =2 


which goes to zero by assumption and the Toeplitz lemma. By Taylor’s theorem 


< 


< 


n x 

1 

n x 

1 

n x 


n n 

— + !) 7 (bk - A k h) - n ^ n + 1 j_ 7 E (^= _ 


fc=l 

n—1 n 


k =1 


E E [j 7 -(i + i) 7 ](& fc -^) 

/c=l J = fc+1 

j -1 


E ^ 71 


1=2 


E (6 fc - A k h) 


-Ei x_1 


k= 1 

1 


i=2 


(j -i)> 


i-i 


E ~ Akh ) 


*=i 


(19) 


which goes to zero by the Toeplitz lemma. 

Step 3: Convergence of g k , hence the rate of convergence of h k follows from 
the Proposition 1 with b = 0, h k = g k , h = 0 and g k = r — 1. □ 

Proposition 1 Suppose {A k }^(_ 1 is a symmetric, positive-semidefinite R dxd ~ 
valued sequence; A is a (symmetric) positive-definite matrix; x 6 (0,1); 0 € 
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(x, 1]; Vk < pr; V > 0 and 


h k +i = h k +—(b k - A k h k )+r] k h k for all k = 1,2,...; 

AC* 

-j n -j n 

— ^{h -b)-> 0 and — A fc — A) ->• 0 . 


fc=i 


fe=l 


( 20 ) 

( 21 ) 


Then, h n —1 h = A 1 b as n —> oo. 

Notation: To ease the notation in the sequel, we will take the product over no 
factors to be 1 and the sum of no terms to be 0. For convenience, we let: 

v k : — h k h , Y k := A k A , z k := b k A k h. (22) 


Proof. Step 1 : Show simplified algorithm with A k s replaced converges. 

1 n 

We note — > z k —> 0 and will show v k —> 0, by proving u k —> 0 and w k := 

n X t ^ 


n x 

fc=l 

v k — u k —> 0 , where 


A 


kx 

By induction, we have: 

n— 1 




U k + 1 = [I - T— + ij k l )u k + — + rj k h subject to u\ = v\. 


fc* 


(23) 


'U'n — 


TT ( I - + ml ) U! + E F j<n Zj + F jtU h for n = 1,2,... (24) 


z=i 


i=i 


i=i 


where 


Fj,n = j* nr=7+i 

F 7 ,n. = Vjj x Fj,n for? = 1, 2,..., n - 1, n = 2,3,... 


(25) 


Hence,by (24), (25) and Lemma 2 i, ii 

n—1 


lim \u n \ < lim 

n—>• oo n—too 


+ lim 

n—^oo 


n b 


1=1 
n—1 

1=1 


A 

Jx 


m 1 




lim 

n—»• oo 


M 


E Fj.nh 
1=1 


= 0. 


Step 2: Transfer stability from A to blocks of A k . 

Define the blocks 

{ n k = L(afc) 1 ~x J := max{i £ N 0 : i < (ak) } 
Ik = {n k , n k + rifc+i - 1 } 


(26) 


( 27 ) 










M.A. Kouritzin and S. Sadeghi/Convergence Rates and Decoupling 


16 


for k = 0 , 1 , 2 ,... and the block products 

f A \ 1 / i \ 1 

u k =n ( i -ii+ r )' 1 ) and ^= n [ 1 ~ix +rji1 ) Tx Yj ' ^ 

1=3 +1 


(e/fc \ 

For the Uk s we have 


i&ik 


n if--,/ 

le/* ie/ fc 1 

ll>l2 


(K 


Ai 2 

% 


- Vh 1 


: E 

Zl ,Zo ,Zq < 

Z 1 > l 2> l 3 


_ l /X 




^ )\j£- vi 3 i j+• • • (-i) fc n(#-^ 


so 


II £4II < 


J -E# 


l€i k 


E^ + 

ie/fc 


JUt-'Hf--', 

l l> l 2 


E 


+ 


*1 >^2> l 3^ I k 
l l> l 2 > l 3 


+n 

lei k 


T-'' 1 f-’'-' I-"'' 


Y~ — fill 
lx 11 


(29) 


/ \ / 

However, we know that V ;yi > 72 >...> (( . a h a j2 ■ ■ ■ a ]k < A (E y Uj) 
so, it follows that 

E 


for dj > 0 


Zl,Z 2 eIfc 

h> l 2 


A h T 

11 Vhl 


Al 2 J 

lX ^ 2 ^ 


+ E 


1 1> 1 2> 1 3 ^ 7 fc 
l l > l 2 > l 3 


A h 

11 


- mJ 


A h 

ll 


- Vh 1 


A h T 

IX Vhl 

l 3 


■n 

i&ik 


A L 
lx 


- till 


^fc +1 n k 

* E 

m =2 


E 

vl£lk 


f WM 

V lx 


+ Vi 


As a result, we find by (29) that 


J -^E 

l&Ik 


1 

E 


E 

i&ik 


E 

lx 


+ E 7 * 

leik 


\\U k \\ < 
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nk+i—nk 

E 


E (Hf + m) 


Jei k 


m\ 


(30) 


Now, let A min and A max be the smallest and biggest eigenvalues of A and define 
a' = where a > 0 is chosen small enough that 


a' < 


1 


An 


. A m m + ||A|| ’ d||A|| ’ e 1 (d||A ||) 2 
Then, by (27) and the fact that 

- n l ~ x ) < E/i^ A;(K+i - C~ x - ( n * - 1 ) 1 ^ x ) 

A iGlk X 

we have limfc _>. 00 (X)/g/ fc — a ') is i n the range of 

( li m n fc+? ~ n l~ X ~ a lim ~ n l~ X ~ a , ril~ x - (n fc - l) 1 ”* ^ 


(31) 


V k—>oo 1 — y 

so by Taylor’s theorem 


lim 

fc—»• oo 


Ei)-*' 


JeJ* 


i* 


/c—>■ oo 


< lim 


1 - x 


i-x 


1 


k—>oo I 1 — 

= o, 


1—X 1—x 

«fc+f- n fe a 


(n fc - 1 )* 


which also implies 




k—>oo ' 


k—>oo 


lx 


(32) 


(33) 


ie/ fc 

For arbitrary e > 0 one finds some iX e > 0 by (32) and (31) such that 


J -^E 




= max iwi E lx ^ 


IX 1 ’ 1 Amin E lx 

k ie/fc ie/fc 

< 1 — A mind 1 + e for all k> K e 


(34) 


Moreover, we can use Lemma 3 of Appendix, (22), (21), (32), (33), Taylor’s 
theorem and the fact d||A||a' < 1 and to obtain a K' e > K t such that 


nk+i~nk 

E 

m—2 


E 

k l£lk 


(<#+.» 


< _ l 3h _ l 3h _ l&Ik 

• m ! 


™fc+l 


m—2 


< e 


l+3e 


(d||A||a' + 3e ) 2 


2 


for all k> K’ e (35) 
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Therefore, by (34), Lemma 2 iii), (30) and (35) one finds 


l|£4ll< 


7-AVl 

ix 


leik 


Y 7 > i 

leik 


y- 

lx 


i&ik 


y 

\ieik 


nk+i—nk 


Mill 

ix 


VI 


m —2 


ml 


< 1 — A m ^ 7 ^a , + 3e + e 


l+3e 


(d\\A\\a'+ 3e) 2 


V k > K’ 


(36) 


Furthermore, using the fact that a! < and making for e > 0 small 

enough, we find from (36) that, there exists a 0 < 7 < 1 and an integer k\ > 0 
such that 


II£4 1 | < 7 for all k > ki 


(37) 


Step 3: Convergence of remainder w n along a subsequence using block sta¬ 
bility of Ak. 

By (20), (22), (24) and Wk := v k - Uk -+ 0 


Wn +1 = { I - —7 + Vnl I Wn- - for n = 1 , 2 ," 

\ n x n x 


(38) 


so it follows by (38) that 


w „ 


41/ 


= n 


l=n k \ 
n —1 n —1 


y n [ i ~T^ +iiii i v n ^ n k- 


j=n k l—j+1 


nX 


In particular, 


w. 


nk+ 


1 = u k w nk - ^2 V j,k u i for k = 0, 1 , - 

jei k 


where Uk is defined in (28) and 


n k+ i-l / ~ \ 

v >*= II '-W + l- 1 


1=3 +1 \ 

By Lemma 2 v) and (41) we obtain, 


rifc+i — 1 

n^, fe ii < n 

1 = 3+1 




\Yj\ 


J 


jx 


< n f 1+ MJi + ^ ii^ii ^ 


(39) 


(40) 


(41) 


lei k 


lx 


J* 


forje/fc, A: = 0,1,..(42) 


J 


jx 
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Therefore, by (37), (42), (28), (40), and (12) we have 

k -1 


K 7 


k—ki 


I + E E 


+ 11711 


l — fei 


jeii 




Vfc > k\. (43) 


In addition, 


E 


Ii4i + ii4ii 

j x 


i= ii+iiE 




r 


jeh jeh jeli 

so using Lemma 2 iv), (26), (32) and finally applying Toeplitz Lemma, we obtain 

1141 + 11411 


lim > 

7 —VrYl ■ ^ 


Moreover, since 
k -1 


l—too L —' 7 

j eh 


l_ 7 fc-fcl k 


ix 


u 3 |= 0 . 


y 1 1 = —-< 1 for all k = fci, fci + 1 , ■ ■ ■ 


l—ki 


1-7 


(44) 


(45) 


it follows from (43), (44), (45) and the Toeplitz Lemma with ai tk = 7 fc 1 lfc 1 <i<fe-i 
and xi = I u j I that 


lim 


< lim 7 1 | w nk 

k—t oo 1 


fc -1 


lim V 7 *-‘- 7 M±!!M | |= o. (46) 


;=fci 


J6.fl 


J 


Step 4: Use w nk —> 0 to show block convergence max n£ / t |w n | 0. 
Now, we return to (39) and find for n £ I k 


w< n 

l=n k 


I- 


4 

IX 


n—1 / 

n 

I-y- 

lx 

=j+i\ 





n—1 


sn (i+^+ukj+e n (i 


n—1 n—1 


;=n fc 




j=nfc l=n k 


II4II 

lx 


m 


Ill'll, 


iei k 


snfi+^+i^Kj+E 


\YA 


3 elk 


snii+^+uiK.i+E 


II4II +1141, 


iei k 


lx 


j elk 


Finally, by (47), (46), Lemma 2 v), and (44) we obtain 


lim max | w n |= 0 . □ 

k—t oo n£lk 


(47) 


( 48 ) 
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Part b) By (9) and (22), z k = k x (i / k +i—i , k) + A k i'k- Averaging, then reordering 
the sum, we have 

^ n i / n n \ 


SO 


k=1 


n* 


v/c=l 




^ n i n 

= ^n+1 - — ” (* ” 1 ) X ) z/ fc + — 


fc=l 


fe=l 


- 

n* ^ 
k= 1 


< | Z'n+l | + E -—- - I V k 


fc= 1 




fc x 


+ E—II^H fcl_x KI- 


fc=l 


(49) 


The second and third terms on the RHS of (49) converge to 0 by the Toeplitz 

lemma with a n>fc = x k = Kl and with a n , k = k * JE 11 , x k = 

fc x- 1 | l/ fc| respectively. □ 


6 . Appendix 

We first establish our promised comparison on our conditions. 


Lemma 1 limsup 

n—> oo 

in n. 


1 n 


k=1 


= 0 implies 1 ||Afc|| *s bounded 


n x 


fe=l 


Proof. By Lemma 3 (to follow) and the fact that E] ^ x 1 < —, one finds 

fc=i x 

that 




n x 


fc=l 


d_ 

n x 


< 


< 


_d_ 

rA 

d 


E^ x_1 4 

fc= 1 
n 

E fc x_ 1 (A fc — A) 


fc=l 




fc=l 


Efc x_1 (^-^) 


fe=i 


d||A|| 


X 


(50) 


Hence, by the fact E]0 X 1 — (j — l) x 1 ) = & x 1 — 1, Taylor’s theorem and the 
j =2 

hypothesis 


1 n 

iijj 


fe=l 
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< 


n x 


d 

< — 

n x 


n k 

EEt r-'-u-vx-^k-A) 

k=2 j=2 

n n 

Eo^-O’-^E^-^) 

1=2 fc=j 


_d_ 

n x 

C 


E(^- A ) 


fe=i 


dpii 


X 


< rf E 

i=2 


(j 1 ~ x - U -1) 1 -*) J_ 

ji-xQ - l)i-x n x 


< (j-!)-*(!-X) J_ 

" ^ j 1_x (j - !) 1_x ' 


1=2 


E(^- A ) 

k—j 

n 

E(^-^ 

fe=i 


=s 2J (i-x)£^ 


. 7—2 


n x 


E(^- A ) 


fe=i 


i 


c 

Y / (A k -A) 

fe =i 

i-i 


C 


rnrEA-D 


U - 1 ) 


fe=l 


■C 


where C = 


d\\A\\ 


1 

+ sup — 

X n n x 


E(^-- A ) 


fc=i 


< oo. This final term is bounded by 


the Toeplitz lemma and our hypothesis. □ 

We give our list of technical bounds used in the proof of Proposition 1. 

Lemma 2 Assume the setting of Proposition 1; and Fj k , I k , and 

{Yk}kLi are as defined in (27), (25) and (22). Then , following are true: 


i) lim 


iii) lim 


k—too 


1=1 


A 


n 


= 0 



n—1 


n —1 

ii) lim 

n—>oo 

E F;j, nZ j 
1=1 

= 0 and lim 

n—>■ oo 

El Fj,nh 
1=1 


= 0 


E- 

lehc 

\\M\ 

l X 


= 0 


+ Ri ] -C 1 for all k = 0,1, 


iv ) E 

leik 

v) JJ 11 + EE + Vi \ 1 for all k = 0,1, • • 


iei k 


lx 


Proof, i) We know || J — A- + r]il\\ is the maximum eigenvalue of ((1 + rg)! — jF) 


and 


0 < 


n—1 


1=1 


I (i + m)i- 


IX 


n—1 


(l+rn)I- 


lx 
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Let D £ R and A min > 0 be the minimum eigenvalue of A and l* be large 
enough that: 1 + r]i — ; + IL > 0, V l > l* so 


n 

i=i* 


(1 + rji)I ~ 


< 


< 


n i+ 


;=;* 


V A r 


l e l x 


I I ^ + 

ex p z, U - 


i=i* 
n—1 


l e l x 


< exp 




An 


;*-i 


zx 


-dx 


V i\l—0 A min 1 —x 


< exp^+^n-l) 1 " 9 -^ 

'—A 


X 


<C exp 


'mm 1—v 

- n X 


2 - 2 X 


Hence, 


n — 1 

n 




0 as n —> oo. 


(51) 


ii) \\(r x +r]rr x -(r-l) x )I-A\\ <| (r x -(r-l) x ) | +rjr x £l + || 2 l|| < l+ 77 +||H|| 
is upper bounded Vr > 1 since x £ (0,1). Hence, by (25) we have 


n—1 / 

n U i+r h)i- 

l=r-\-1 


< 


Z—r+l 
n—1 


n —1 ✓ 

n f(i+»?o / - 

—r+l ' 


+ 

1a, 


Z—r+l ' 

(r^l) x )/-+| 
-1 


(r - l)x + 1 + r)r ^ 1 r x / 

1 

r x (r - l)x 


{r x + rj r 


r,n 1 

^ r x {r — l)x 


n 

Z—r+l ' y 


( 52 ) 
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for all r = 2, 3 ,n — 1, n = 3,4,.... Letting A denote an arbitrary eigenvalue 
of A and L c = {Z : & — 1 — rji > c}, we have that 


l=r+1 ' 




Z* 


leL 1 


n—1 


^ IT ( ^ - 1 - *n) x ex P ( e »»- ^ 




iei 1 


W 6 L° 


2=r+l,2gL° 

A_ 

1* 


X 

r,n / fj A ^ 

« “p| E jj-p? 

\2=r+l y 


< exp ^-(r + l) 1 x }^ 


(53) 


and it follows from (53), the fact that the eigenvectors of A span R d and the 
principle of uniform boundedness that 


n—1 

n ((1 + m)i- 

l=r-\-1 




(54) 


It follows by (25), (52) and (54) that 


n— 1 


n— 1 


E>-i)*iiF r _ lin -F r , n || « E- 


i 


e 2 - 2 * 


r=2 


r x 

r—2 

X m.in 1-X 


< e 2 - 2 * 


/ —e 2 - 2 * 4 dt 

'2 t X 


« 1 Vn = 3,4,... 

n—1 n—1 n—1 / n—1 

Next, ^ ^ Fj, n Zj = ^ ^ F n -l,nZj ^ ^ I ^ ^ -^r—l,n F r ^ n J £7 9 - 11(1 
1=1 1 =1 1 =1 \ r = l + 1 


(55) 


n—1 n—1 

E E (F r — i ?n 

1=1 r=l+l 
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E K-i, 
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F r . 
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.7=1 



















M.A. Kouritzin and S. Sadeghi/Convergence Rates and Decoupling 


24 


Therefore, by assumption, (25), (55) and Toeplitz’ lemma with a Ujr = (r — 
l) x ||F r ._i j „ - F r , n || and x r = prrijx I z j I we have: 


n—1 

E Fj,n z j 

3=1 

.< l,n || 

n—1 

E -i 

i=i 

+ 

71—1 71—1 

E E (^r-l.n - 

3=1 r=3+l 

Fr,n)Zj 


1 

«< 

n—1 

E^ 

3=1 

n—1 

+ ^ ] H-^r—l,n — -^r,n|| 
r—2 

r—1 

E^ 

3 = 1 


~ (n— l)x 


0.(56) 


as n —> oo.Turning to the second limit in ii), we have by (25) and (54) that 

n—1 n—1 




(57) 


3 =1 


3=1 


However, 

n— 1 


3 =1 


J x 


- e 2 - 2 X V L ■' ^“ r - L ^ ^ ^ e 2 “ 2 X 

ir 


f 1 

J 1 * x 


< 1 


for all n so the second limit in ii) follows by the Toeplitz lemma. 
l — l / \ 

Hi) Since £ = ^ + E ~ V 1 e /fe ’ one has that 


(58) 


E 

ieh 


1 

< — 

n k 


E^ 


E 

r<Z 

r.ien 


(r + l)x r* 




Hence, by Taylor’s theorem 


E 

lehc 
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1 


— Y 

n,: 


L k 

n k+ 1-2 

E 

r=n k 


< — 


E * 

l<n k+ 1 


- (r + 1 )> 
r*(r + l)x 


E * 

l<n k+ 1 


E^ 

Kn k 


E *-E* 

l<n fc+ i l<r 


E^ 

Kn k 
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r&I k 
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E * 

l<n k+ 1 


E^ 

l<r 


(59) 
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where the summations all start from l = 1 and stop at l = rik — 1 , r or rik+i — 1. 
Furthermore, by the hypothesis and (27) we have that 


lim max — 

k —>oo rg/fc 


E * 

Kn k+ 1 


= 0 


(60) 


and the first two terms on the RHS of (59) go to zero. Moreover, by (27) 

E " < 

r V njfc-l / 


rGlk 


, / L(a(fc + 1))—J -1 , n . 

= log I -- 5 - | —> 0 as k —> oo 


L(afc) 1 -xJ — 1 


(61) 


due to the fact that 


L(afc) 1 -x J — 1 (ak) 1 -*- 2 l-(^) 1 “ x 


In addition, by assumption, (60) and (61) 


E 

relk 


7*X+1 


E * 


fc V 1 1 

^ r r n k 
relk K 


E * 

I<n fc+ i 


0 as k —> oo 


and 


E 

relk 


X 


pX+i 


M 

< y 

< J r rX 

E^ 

l<r 

relk 

l<r 


0 as k —^ oo 


Hence, the last term on the RHS of (59) goes to zero too. 

iv) By Lemma 3, the fact that ||B|| <||| B |||< \/d||-B|| for a matrix with rank 

d, iii) and (32) we have 
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leik 


Al 

lx 

W2 

VI 

Ai 

lx 

< \fd 

Ai 

2^ 7^ 
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2^ ] x 
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■d||A||E 
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(62) 


Moreover, by (33) Ri < R ^ —> 0 as k 


iei k 


ieik 
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v) This follows by iv) and the fact that 



< 1, V fc = 0,1,... □ (63) 

The following lemma is taken from Kouritzin [22]. 

Lemma 3 Suppose m is a positive integer and {M k , k = 1,2,3,...} is a sequence 
of symmetric, positive semidefinite R mxrn -matrices. Then, it follows that 


j 

Y HI M k |||< y/rn 
fc=l 



Vj = 1,2,3,... 
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